Goto

Collaborating Authors

 mean train loss


Practical tradeoffs between memory, compute, and performance in learned optimizers

arXiv.org Artificial Intelligence

Optimization plays a costly and crucial role in developing machine learning systems. In learned optimizers, the few hyperparameters of commonly used hand-designed optimizers, e.g. Adam or SGD, are replaced with flexible parametric functions. The parameters of these functions are then optimized so that the resulting learned optimizer minimizes a target loss on a chosen class of models. Learned optimizers can both reduce the number of required training steps and improve the final test loss. However, they can be expensive to train, and once trained can be expensive to use due to computational and memory overhead for the optimizer itself. In this work, we identify and quantify the design features governing the memory, compute, and performance trade-offs for many learned and hand-designed optimizers. We further leverage our analysis to construct a learned optimizer that is both faster and more memory efficient than previous work. Despite the huge computational costs associated with training large neural models, the set of optimization algorithms used to train them has largely been restricted to simple update functions mapping from gradients to parameter updates (e.g. These algorithms typically depend on a small number of hand-designed features and parameters. However, the last decade in machine learning research has repeatedly seen small, hand-designed models outperformed by parameterized models (such as neural networks) trained to purpose on large amounts of data (LeCun et al., 2015). Thus, a promising direction to improve training performance and reduce costs is to replace hand-designed optimizers with more expressive learned optimizers, trained on problems similar to those encountered in practice. Learned optimizers specify parameter update rules using a flexible parametric form and learn the parameters of this function from a "dataset" of optimization tasks--a procedure typically referred to as meta-training or meta-learning (Andrychowicz et al., 2016; Finn et al., 2017; Hochreiter et al., 2001). Learned optimizers represent a path towards improved optimizer performance, and possess the ability to target different objectives (e.g. Despite being an active area of research (Andrychowicz et al., 2016; Wichrowska et al., 2017; Chen et al., 2020; Metz et al., 2020b; 2021; Almeida et al., 2021; Zheng et al., 2022), they are not yet commonly used in practice. Several challenges have limited the widespread application of learned optimizers: they are typically difficult to meta-train on a task family of interest, they can require significant memory and compute overhead when applied, and they often generalize less well to novel tasks than hand-designed optimizers.